System and Method for Independent, Direct and Parallel Communication Among Multiple Field Programmable Gate Arrays

ABSTRACT

Representative embodiments are disclosed for data transfer between field programmable gate arrays (FPGAs). A representative system includes: a PCIe communication network comprising a PCIe switch and a plurality of PCIe communication lines; a host computing system coupled to the PCIe communication network; a nonblocking crossbar switch; a plurality of memory circuits; and a plurality of field programmable gate arrays, each field programmable gate array configurable for a plurality of data transfers to and from the host computing system and any other field programmable gate array of the plurality of field programmable gate arrays, with each data transfer including a designation of a first memory address, a file size, and a stream number. Once base DMA registers have been initialized for a selected application, no further involvement by the host computing system is involved for the duration of the selected application.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of and claims the benefit of andpriority to U.S. patent application Ser. No. 15/669,136, filed Aug. 4,2017, inventors Gregory M. Edvenson et al., titled “System and Methodfor Independent, Direct and Parallel Communication Among Multiple FieldProgrammable Gate Arrays”, which is a continuation of and claims thebenefit of and priority to U.S. patent application Ser. No. 14/608,464,filed Jan. 29, 2015 and issued Aug. 8, 2017 as U.S. Pat. No. 9,727,510,inventors Gregory M. Edvenson et al., titled “System and Method forIndependent, Direct and Parallel Communication Among Multiple FieldProgrammable Gate Arrays”, which is a nonprovisional of and, under 35U.S.C. Section 119, further claims the benefit of and priority to U.S.Provisional Patent Application No. 61/940,009, filed Feb. 14, 2014,inventors Jeremy B. Chritz et al., titled “High Speed, ParallelConfiguration of Multiple Field Programmable Gate Arrays”, which iscommonly assigned herewith, the entire contents of which areincorporated herein by reference with the same full force and effect asif set forth in its entirety herein, and with priority claimed for allcommonly disclosed subject matter.

U.S. patent application Ser. No. 14/608,464 also is a nonprovisional ofand, under 35 U.S.C. Section 119, further claims the benefit of andpriority to U.S. Provisional Patent Application No. 61/940,472, filedFeb. 16, 2014, inventors Jeremy B. Chritz et al., titled “System andMethod for Independent, Direct and Parallel Communication Among MultipleField Programmable Gate Arrays”, which is commonly assigned herewith,the entire contents of which are incorporated herein by reference withthe same full force and effect as if set forth in its entirety herein,and with priority claimed for all commonly disclosed subject matter.

U.S. patent application Ser. No. 14/608,464 is a continuation-in-part ofand further claims priority to U.S. patent application Ser. No.14/213,495, filed Mar. 14, 2014and issued Aug. 22, 2017 as U.S. Pat. No.9,740,798, inventors Paul T. Draghicescu, Gregory M. Edvenson, and CoreyB. Olson, titled “Inexact Search Acceleration”, which is acontinuation-in-part of and further claims priority to U.S. patentapplication Ser. No. 14/201,824, filed Mar. 8, 2014 and issued Aug. 15,2017 as U.S. Pat. No. 9,734,284, inventor Corey B. Olson, titled“Hardware Acceleration of Short Read Mapping for Genomic and Other Typesof Analyses”, both of which further claim priority to and the benefit ofU.S. Provisional Patent Application No. 61/940,472 and U.S. ProvisionalPatent Application No. 61/940,009 as referenced above, and further claimpriority to and the benefit under 35 U.S.C. Section 119 of U.S.Provisional Patent Application No. 61/790,407, filed Mar. 15, 2013,inventor Corey B. Olson, titled “Hardware Acceleration of Short ReadMapping”, and of U.S. Provisional Patent Application No. 61/790,720,filed Mar. 15, 2013, inventors Paul T. Draghicescu, Gregory M. Edvenson,and Corey B. Olson, titled “Inexact Search Acceleration on FPGAs Usingthe Burrows-Wheeler Transform”, which are commonly assigned herewith,the entire contents of which are incorporated herein by reference withthe same full force and effect as if set forth in their entiretiesherein, and with priority claimed for all commonly disclosed subjectmatter.

U.S. patent application Ser. No. 14/608,464 is a continuation-in-part ofand further claims priority to U.S. patent application Ser. No.14/201,824, filed Mar. 8, 2014 and issued Aug. 15, 2017 as U.S. Pat. No.9,734,284, inventor Corey B. Olson, titled “Hardware Acceleration ofShort Read Mapping for Genomic and Other Types of Analyses”, whichfurther claims priority to and the benefit of U.S. Provisional PatentApplication No. 61/940,472, U.S. Provisional Patent Application No.61/940,009, U.S. Provisional Patent Application No. 61/790,407, and U.S.Provisional Patent Application No. 61/790,720 as referenced above, whichare commonly assigned herewith, the entire contents of which areincorporated herein by reference with the same full force and effect asif set forth in their entireties herein, and with priority claimed forall commonly disclosed subject matter.

FIELD OF THE INVENTION

The present invention relates generally to computing systems, and morespecifically to the independent, direct and parallel communication amongmultiple configurable logic circuits such as a plurality of FPGAs.

BACKGROUND

Communication among configurable logic circuits such as fieldprogrammable gate arrays (“FPGAs”) typically involves considerable hostsystem involvement, which is highly undesirable and often unacceptablefor supercomputing applications. For example, Xilinx FPGAs typicallyrequire host involvement in setting up registers for every datatransfer.

In addition, supercomputing applications would be served advantageouslyby parallel involvement of multiple FPGAs capable of operatingindependently and without extensive host involvement.

Accordingly, a need remains for a system having both hardware andsoftware co-design to provide for independent, direct and parallelcommunication among multiple configurable logic circuits such as aplurality of FPGAs. Such a system should further provide for minimalhost involvement, and for significantly parallel and rapid datatransfers, including to and from memory located anywhere within thesystem.

SUMMARY OF THE INVENTION

The exemplary embodiments of the present invention provide numerousadvantages. Exemplary embodiments provide for direct FPGA-to-FPGA datatransfers in a system without involvement of the host computing system.This allows for independent, direct and parallel communication amongmultiple configurable logic circuits such as a plurality of FPGAs.

A representative embodiment includes a system couplable to a hostcomputing system, with the system comprising: a PCIe communicationnetwork comprising a PCIe switch and a plurality of PCIe communicationlines; a plurality of memory circuits; and a plurality of fieldprogrammable gate arrays, each field programmable gate array coupled tothe PCIe communication network and to at least one memory circuit of theplurality of memory circuits, each field programmable gate arrayconfigurable for a plurality of data transfers to any other fieldprogrammable gate array of the plurality of field programmable gatearrays, each data transfer including a designation of a first memoryaddress and a stream number. As an option, each data transferdesignation may further comprise a file size.

A representative embodiment may further include at least one tertiaryfield programmable gate array configured as a non-blocking crossbarswitch and coupled to the plurality of field programmable gate arrays.In a representative embodiment each data transfer is through the PCIecommunication network or through the nonblocking crossbar switch. In arepresentative embodiment, each data transfer occurs without involvementof the host computing system.

In a representative embodiment, prior to commencement of a computingapplication, the host computing system transmits a plurality of DMAregister messages to one or more field programmable gate arrays of theplurality of field programmable gate arrays, each DMA register messagedesignating a memory address of the plurality of memory circuits, thefile size, and the stream number. In such a representative embodiment,each DMA register maintains its designations until another DMA registermessage changing the designations is received.

Each data transfer may further include a designation of a second memoryaddress and a tie stream number. In such a representative embodiment, inresponse to receiving a data transfer including designation of thesecond memory address and the tie stream number, each field programmablegate array is configurable to forward the data transferred to the secondmemory address and the tie stream number.

A representative embodiment may further include a plurality of datacommunication lines coupling the plurality of field programmable gatearrays in series, and wherein one or more data transfers occur directlythrough the plurality of data communication lines and withoutinvolvement of the host computing system.

Another representative embodiment includes a system couplable to a hostcomputing system, with the system comprising: a PCIe communicationnetwork comprising a PCIe switch and a plurality of PCIe communicationlines; a nonblocking crossbar switch; a plurality of memory circuits;and a plurality of field programmable gate arrays, each fieldprogrammable gate array coupled to the PCIe communication network, tothe nonblocking crossbar switch, and to at least one memory circuit ofthe plurality of memory circuits, each field programmable gate arrayconfigurable for a plurality of data transfers to any other fieldprogrammable gate array of the plurality of field programmable gatearrays, each data transfer including a designation of a first memoryaddress and a stream number. As an option, each data transferdesignation may further comprise a file size.

In a representative embodiment, each data transfer is through the PCIecommunication network or through the nonblocking crossbar switch, andeach data transfer occurs without involvement of the host computingsystem.

Also in a representative embodiment, prior to commencement of acomputing application, the host computing system initializes the systemand transmits a plurality of DMA register messages to one or more fieldprogrammable gate arrays of the plurality of field programmable gatearrays, each DMA register message designating a memory address of theplurality of memory circuits, the file size, and the stream number. EachDMA register maintains its designations until another DMA registermessage changing the designations is received.

In another representative embodiment, each data transfer furtherincludes a designation of a second memory address and a tie streamnumber. In response to receiving a data transfer including designationof the second memory address and the tie stream number, each fieldprogrammable gate array is configurable to forward the data transferredto the second memory address and the tie stream number.

In a representative embodiment, the nonblocking crossbar switch isimplemented using a selected field programmable gate array of theplurality of field programmable gate arrays. In another representativeembodiment, the nonblocking crossbar switch is implemented using thePCIe switch or a second PCIe switch.

In another representative embodiment, a system comprises: a PCIecommunication network comprising a PCIe switch and a plurality of PCIecommunication lines; a host computing system coupled to the PCIecommunication network; a nonblocking crossbar switch; a plurality ofmemory circuits; and a plurality of field programmable gate arrays, eachfield programmable gate array coupled to the PCIe communication network,to the nonblocking crossbar switch, and to at least one memory circuitof the plurality of memory circuits, each field programmable gate arrayconfigurable for a plurality of data transfers to and from the hostcomputing system and any other field programmable gate array of theplurality of field programmable gate arrays, each data transferincluding a designation of a first memory address, a file size, and astream number.

In another representative embodiment, a system comprises: a PCIecommunication network comprising a PCIe switch and a plurality of PCIecommunication lines; a host computing system coupled to the PCIecommunication network; a first field programmable gate arrayconfigurable as a nonblocking crossbar switch; a plurality of memorycircuits; and a plurality of second field programmable gate arrays, eachsecond field programmable gate array coupled to the PCIe communicationnetwork, to the first field programmable gate array configurable as anonblocking crossbar switch, and to at least one memory circuit of theplurality of memory circuits, each second field programmable gate arrayconfigurable for a plurality of data transfers to and from the hostcomputing system and any other second field programmable gate array ofthe plurality of second field programmable gate arrays, each datatransfer including a designation of a first memory address, a file size,and a stream number.

In another representative embodiment, a method of data transfer in asystem is disclosed, the system comprising a host computing system, aPCIe communication network, a nonblocking crossbar switch, a pluralityof memory circuits, and a plurality of field programmable gate arrays,with the method comprising: using the host computing system,transmitting a plurality of DMA register messages to one or more fieldprogrammable gate arrays of the plurality of field programmable gatearrays, each DMA register message designating a first memory address ofthe plurality of memory circuits, a file size, and a stream number; andusing at least one field programmable gate array of the plurality offield programmable gate arrays, transferring data to any other fieldprogrammable gate array of the plurality of field programmable gatearrays, each data transfer including a designation of a selected firstmemory address, file size, and stream number.

Numerous other advantages and features of the present invention willbecome readily apparent from the following detailed description of theinvention and the embodiments thereof, from the claims and from theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects, features and advantages of the present invention will bemore readily appreciated upon reference to the following disclosure whenconsidered in conjunction with the accompanying drawings, wherein likereference numerals are used to identify identical components in thevarious views, and wherein reference numerals with alphabetic charactersare utilized to identify additional types, instantiations or variationsof a selected component embodiment in the various views, in which:

FIG. 1 is a block diagram illustrating an exemplary or representativefirst system embodiment.

FIG. 2 is a block diagram illustrating an exemplary or representativesecond system embodiment.

FIG. 3 is a block diagram illustrating an exemplary or representativethird system embodiment.

FIG. 4 is a block diagram illustrating an exemplary or representativefourth system embodiment.

FIG. 5 is a flow diagram illustrating an exemplary or representativeconfiguration method embodiment.

FIG. 6 is a block diagram illustrating exemplary or representativefields for a (stream) packet header.

FIG. 7 is a flow diagram illustrating an exemplary or representativecommunication method embodiment.

DETAILED DESCRIPTION OF REPRESENTATIVE EMBODIMENTS

While the present invention is susceptible of embodiment in manydifferent forms, there are shown in the drawings and will be describedherein in detail specific exemplary embodiments thereof, with theunderstanding that the present disclosure is to be considered as anexemplification of the principles of the invention and is not intendedto limit the invention to the specific embodiments illustrated. In thisrespect, before explaining at least one embodiment consistent with thepresent invention in detail, it is to be understood that the inventionis not limited in its application to the details of construction and tothe arrangements of components set forth above and below, illustrated inthe drawings, or as described in the examples. Methods and apparatusesconsistent with the present invention are capable of other embodimentsand of being practiced and carried out in various ways. Also, it is tobe understood that the phraseology and terminology employed herein, aswell as the abstract included below, are for the purposes of descriptionand should not be regarded as limiting.

FIG. 1 is a block diagram illustrating an exemplary or representativefirst system 100 embodiment. FIG. 2 is a block diagram illustrating anexemplary or representative second system 200 embodiment. FIG. 3 is ablock diagram illustrating an exemplary or representative third system300 embodiment and first apparatus embodiment. FIG. 4 is a block diagramillustrating an exemplary or representative fourth system 400embodiment.

As illustrated in FIGS. 1-4, the systems 100, 200, 300, 400 include oneor more host computing systems 105, such as a computer or workstation,having one or more central processing units (CPUs) 110, which may be anytype of processor, and host memory 120, which may be any type of memory,such as a hard drive or a solid state drive, and which may be locatedwith or separate from the host CPU 110, all for example and withoutlimitation, and as discussed in greater detail below. The memory 120typically stores data to be utilized in or was generated by a selectedapplication and also generally a configuration bit file or image for aselected application. Not separately illustrated, any of the hostcomputing systems 105 may include a plurality of different types ofprocessors, such as graphics processors, multi-core processors, etc.,also as discussed in greater detail below. The various systems 100, 200,300, 400 differ from one another in terms of the arrangements of circuitcomponents (including on or in various modules), types of components,and types of communication between and among the various components, asdescribed in greater detail below.

The one or more host computing systems 105 are typically coupled throughone or more communication channels or lines, illustrated as PCI express(Peripheral Component Interconnect Express or “PCIe”) lines 130, eitherdirectly or through a PCIe switch 125, to one or more configurable logicelements such as one or more FPGAs 150 (including FPGAs 160, 170) (suchas a Spartan 6 FPGA or a Kintex-7 FPGA, both available from Xilinx, Inc.of San Jose, Calif., US, or a Stratix 10 or Cyclone V FPGA availablefrom Altera Corp. of San Jose, Calif., US, for example and withoutlimitation), each of which in turn is coupled to a nonvolatile memory140, such as a FLASH memory (such as for storing configuration bitimages), and to a plurality of random access memories 190, such as aplurality of DDR3 (SODIMM) memory integrated circuits, such as for datastorage for computation, communication, etc., for example and withoutlimitation. In a first embodiment as illustrated, each FPGA 150 andcorresponding memories 140, 190 directly coupled to that FPGA 150 arecollocated on a corresponding computing module (or circuit board) 175 asa module or board in a rack mounted system having many such computingmodules 175, such as those available from Pico Computing of Seattle,Washington US. As illustrated, each computing module 175 includes as anoption PCIe input and output (I/O) connector(s) 230 to provide the PCIe130 connections, such as for a rack mounted system. In representativeembodiments, the I/O connector(s) 230, 235 may also include additionalcoupling functionality, such as JTAG coupling, input power, ground,etc., for example and without limitation, and are illustrated with suchadditional connectivity in FIG. 4. The PCIe switch 125 may be located orpositioned anywhere in a system 100, 200, 300, 400, such as on aseparate computing module (such as a backplane circuit board, which canbe implemented with computing module 195, for example), or on any of thecomputing modules 175, 180, 185, 195, 115 for example and withoutlimitation. In addition, other types of communication lines or channelsmay be utilized to couple the one or more host computing systems 105 tothe FPGAs 150, such as an Ethernet line, which in turn may be coupled toother intervening rack-mounted components to provide communication toand from one or more FPGAs 150 (160, 170) and other modules. Also inaddition, the various FPGAs 150 (160, 170) may have additional oralternative types of communication between and among the PCIe switch 125and other FPGAs 150 (160, 170), such as via general purpose (GP) I/Olines 131 (illustrated in FIG. 4).

PCIe switch 125 (e.g., available from PLX Technology, Inc. of Sunnyvale,Calif., US), or one or more of the FPGAs 150 (160, 170), may also beconfigured (as an option) as one or more non-blocking crossbar switches220, illustrated in FIG. 1 as part of (or a configuration of) PCIeswitch 125. The non-blocking crossbar switch 220 provides for pairwiseand concurrent communication (communication lines 221) between and amongthe FPGAs 150, 160, 170 and any of various memories (120, 190, forexample and without limitation), without communication between any givenpair of FPGAs 150, 160, 170 blocking any other communication betweenanother pair of FPGAs 150, 160, 170. In exemplary embodiment, one ormore non-blocking crossbar switches 220 are provided (within a PCIeswitch 125) to have sufficient capacity to enable direct FPGA to FPGAcommunication between and among all of the FPGAs 150, 160, 170 in aselected portion of the system 100, 200, 300, 400. In anotherrepresentative embodiment, one or more non-blocking crossbar switches220 are implemented using one or more FPGAs 150 which have beenconfigured accordingly, as illustrated in FIG. 2, which may also beconsidered a tertiary (or third) FPGA 150 when included in the varioushierarchical embodiments, such as illustrated in FIG. 2. In anotherrepresentative embodiment, one or more non-blocking crossbar switches220 are implemented using one or more PCIe switches 125 which also havebeen configured accordingly, illustrated as second PCIe switch 125A inFIG. 4. In another exemplary embodiment not separately illustrated, oneor more non-blocking crossbar switches 220 are provided internallywithin any of the one or more FPGAs 150, 160, 170 for concurrentaccesses to a plurality of memories 190, for example and withoutlimitation.

Referring to FIG. 2, the system 200 differs insofar as the various FPGAsare hierarchically organized into one or more primary (or central)configurable logic elements such as one or more primary FPGAs 170 and aplurality of secondary (or remote) configurable logic elements such asone or more secondary FPGAs 160 (FPGAs 150, 160, 170 may be any type ofconfigurable logic elements (such as a Spartan 6 FPGA, a Kintex-7 FPGA,a Stratix 10, a Cyclone V FPGA as mentioned above, also for example andwithout limitation). The one or more host computing systems 105 aretypically coupled through one or more communication channels or lines,illustrated as PCI express (Peripheral Component Interconnect Express or“PCIe”) lines 130, either directly or through a PCIe switch 125, toprimary FPGAs 170, each of which in turn is coupled to a plurality ofsecondary FPGAs 160, also through one or more correspondingcommunication channels, illustrated as a plurality of JTAG lines 145(Joint Test Action Group (“JTAG”) is the common name for the IEEE 1149.1Standard Test Access Port and Boundary-Scan Architecture), or throughany of the PCIe lines 130 or GP I/O lines 131. In this embodiment,(illustrated in FIG. 2), each of the secondary FPGAs 160 is provided ona separate computing module 185 which is couplable (through I/Oconnector(s) 235 and PCIe lines 130 and/or JTAG lines 145) to thecomputing module 180 having the primary FPGA 170. In variousembodiments, the PCIe lines 130 and JTAG lines 145 are illustrated aspart of a larger bus (which may also include GP I/O lines 131), andtypically routed to different pins on the various FPGAs 150, 160, 170,typically via I/O connectors 235, for example, for the various modularconfigurations or arrangements. As mentioned above, other lines, such asfor power, ground, clocking (in some embodiments), etc., also may beprovided to a computing module 185 via I/O connectors 235, for exampleand without limitation. Not separately illustrated in FIG. 2, PCIeswitch 125 also may be coupled to a separate FPGA, such as an FPGA 150,such as illustrated in FIG. 1, which also may be coupled to anonvolatile memory 140, for example and without limitation.

The PCIe switch 125 may be positioned anywhere in a system 100, 200,300, 400, such as on a separate computing module, for example andwithout limitation, or on one or more of the computing modules 180having the primary FPGA 170, as illustrated in FIG. 4 for computingmodule 195, which can be utilized to implement a backplane for multiplemodules 175, as illustrated. In an exemplary embodiment, due to asignificantly large fan out of the PCIe lines 130 to other modules andcards in the various systems 100, 200, 300, 400, the PCIe switch 125 istypically located on the backplane of a rack-mounted system (availablefrom Pico Computing, Inc. of Seattle, Wash. US). A PCIe switch 125 mayalso be collocated on various computing modules (e.g., 195), to whichmany other modules (e.g., 175) connect (e.g., through PCIe connector(s)230 or, more generally, I/O connectors 235 which include PCIe, JTAG,GPIO, power, ground, and other signaling lines). In addition, othertypes of communication lines or channels may be utilized to couple theone or more host computing systems 105 to the primary FPGAs 170 and orsecondary FPGAs 160, such as an Ethernet line, which in turn may becoupled to other intervening rack-mounted components to providecommunication to and from one or more primary FPGAs 170 and othermodules.

In this system 200 embodiment, the primary and secondary FPGAs 170 and160 are located on separate computing modules 180 and 185, also in arack mounted system having many such computing modules 180 and 185, alsosuch as those available from Pico Computing of Seattle, Wash. US. Thecomputing modules 180 and 185 may be coupled to each other via any typeof communication lines, including PCIe and/or JTAG. For example, in anexemplary embodiment, each of the secondary FPGAs 160 is located on amodular computing module (or circuit board) 185 which have correspondingI/O connectors 235 to plug into a region or slot of the primary FPGA 170computing module 180, up to the capacity of the primary FPGA 170computing module 180, such as one to six modular computing modules 185having secondary FPGAs 160. In representative embodiments, the I/Oconnector(s) 235 may include a wide variety of coupling functionality,such as JTAG coupling, PCIe coupling, GP I/O, input power, ground, etc.,for example and without limitation. For purposes of the presentdisclosure, systems 100, 200, 300, 400 function similarly, and any andall of these system configurations are within the scope of thedisclosure.

Not separately illustrated in FIGS. 1-4, each of the various computingmodules 175, 180, 185, 195, 115 typically include many additionalcomponents, such as power supplies, additional memory, additional inputand output circuits and connectors, switching components, clockcircuitry, etc.

The various systems 100, 200, 300, 400 may also be combined into aplurality of system configurations, such as mixing the different typesof FPGAs 150, 160, 170 and computing modules 175, 180, 185, 195, 115into the same system, including within the same rack-mounted system.

Additional representative system 300, 400 configurations or arrangementsare illustrated in FIGS. 3 and 4. In the system 300 embodiment, theprimary and secondary FPGAs 150 and 160, along with PCIe switch 125, areall collocated on a dedicated computing module 115 as a large module ina rack mounted system having many such computing modules 115, such asthose available from Pico Computing of Seattle, Wash. US. In the system400 embodiment, (illustrated in FIG. 4), each of the secondary FPGAs 160is provided on a separate computing module 175 which is couplable to thecomputing module 195 having the primary FPGA 170. PCIe switches 125 arealso illustrated as collocated on computing module 195 for communicationwith secondary FPGAs 160 over PCIe communication lines 130, althoughthis is not required and such a PCIe switch 125 may be positionedelsewhere in a system 100, 200, 300, 400, such as on a separatecomputing module, for example and without limitation.

The representative system 300 illustrates some additional features whichmay be included as options in a computing module, and is furtherillustrated as an example computing module 115 which does not includethe optional nonblocking crossbar switch 220 (e.g., in a PCIe switch 125or as a configuration of an FPGA 150, 160, 170). As illustrated in FIG.3, the various secondary FPGAs 160 also have direct communication toeach other, with each FPGA 160 coupled through communication lines 210to its neighboring FPGAs 160, such as serially or “daisy-chained” toeach other. Also, one of the FPGAs 160, illustrated as FPGA 160A, hasbeen coupled through high speed serial lines 215, to a hybrid memorycube (“HMC”) 205, which incorporates multiple layers of memory and atleast one logic layer, with very high memory density capability. Forthis system 300, the FPGA 160A has been configured as a memorycontroller (and potentially a switch or router), providing access andcommunication to and from the HMC 205 for any of the various FPGAs 160,170.

As a consequence, for purposes of the present disclosure, a system 100,200, 300, 400 comprises one or more host computing systems 105,couplable through one or more communication lines (such as GP I/O lines131 or PCIe communication lines (130), directly or through a PCIe switch125), to one or more FPGAs 150 and/or primary

FPGAs 170. In turn, each primary FPGA 170 is coupled through one or morecommunication lines, such as JTAG lines 145 or PCIe communication lines130 or GP I/O lines 131, to one or more secondary FPGAs 160. Dependingupon the selected embodiment, each FPGA 150, 160, 170 is optionallycoupled to a non-blocking crossbar switch 220 (e.g., in a PCIe switch125 or as a configuration of an FPGA 150, 160, 170) for pairwisecommunication with any other FPGA 150, 160, 170. In addition, each FPGA150, 160, 170 is typically coupled to one or more nonvolatile memories140 and one or more random access memories 190, which may be any type ofrandom access memory.

Significant features are enabled in the system 100, 200, 300, 400 as anoption, namely, the highly limited involvement of the host CPU 110 inconfiguring any and all of the FPGAs 150, 160, 170, which frees the hostcomputing system 105 to be engaged in other tasks. In addition, theconfiguration of the FPGAs 150, 160, 170 may be performed in a massivelyparallel process, allowing significant time savings. Moreover, becausethe full configurations of the FPGAs 150, 160, 170 are not required tobe stored in nonvolatile memory 140 (such as FLASH), with correspondingread/write cycles which are comparatively slow, configuration of theFPGAs 150, 160, 170 may proceed at a significantly more rapid rate,including providing new or updated configurations. The various FPGAs150, 160, 170 may also be configured as known in the art, such as byloading a complete configuration from nonvolatile memory 140.

Another significant feature of the systems 100, 200, 300, 400 is thatonly basic (or base) resources for the FPGAs 150 or primary FPGAs 170are stored in the nonvolatile memory 140 (coupled to a FPGA 150 or aprimary FPGA 170), such as a configuration for communication over thePCIe lines 130 (and possibly GP I/O lines 131 or JTAG lines 145, such asfor secondary FPGAs 160), and potentially also a configuration for oneor more DMA engines (depending upon the selected FPGA 150, 160, 170, theFPGA 150, 160, 170 may be available with incorporated DMA engines). As aresult, upon system 100, 200, 300, 400 startup, the only configurationsrequired to be loaded into the FPGA 150 or primary FPGA 170 is limitedor minimal, namely, communication (e.g., PCIe and possibly JTAG)functionality and or DMA functionality. In a representative embodiment,upon system 100, 200, 300, 400 startup, the only configuration requiredto be loaded into the FPGA 150 or a primary FPGA 170 is a communicationconfiguration for PCIe functionality. As a consequence, this base PCIeconfiguration may be loaded quite rapidly from the nonvolatile memory140. Stated another way, except for loading of the base communicationconfiguration for PCIe functionality, use of the nonvolatile memory 140for FPGA configuration is bypassed entirely, both for loading of aninitial configuration or an updated configuration.

Instead of a host CPU 110 “bit banging” or transferring a very largeconfiguration bit image to each FPGA 150 or primary FPGA 170,configuration of the system 100, 200, 300, 400 occurs rapidly and inparallel when implemented in representative embodiments. Configurationof the FPGAs 150 or primary FPGAs 170 and secondary FPGAs 160 beginswith the host CPU 110 merely transmitting a message or command to one ormore FPGAs 150 or primary FPGAs 170 with a memory address or location inthe host memory 120 (and typically also a file size) of theconfiguration bit image (or file) which has been stored in the hostmemory 120, i.e., the host CPU 110 sets the DMA registers of the FPGA150 or primary FPGA 170 with the memory address and file size for theselected configuration bit image (or file) in the host memory 120. Sucha “load FPGA” command is repeated for each of the FPGAs 150 or primaryFPGAs 170 (and possibly each secondary FPGA 160, depending upon theselected embodiment), i.e., continuing until the host CPU 110 does notfind any more FPGAs 150 or primary FPGAs 170 (and/or secondary FPGAs160) in the system 100, 200, 300, 400 and an error message may bereturned. Typically, the host CPU 110 transmits one such message orcommand to each FPGA 150 or primary FPGA 170 that will be handling athread of a parallel, multi-threaded computation. In the representativeembodiments, the host CPU 110 is then literally done with theconfiguration process, and is typically notified with an interruptsignal from a FPGA 150 or primary FPGA 170 once configuration iscomplete. Stated another way, from the perspective of the host computingsystem 105, following transmission of generally a single message orcommand having a designation of a memory address (and possibly a filesize), the configuration process is complete. This is a huge advanceover prior art methods of FPGA configuration in supercomputing systems.

Using a DMA engine, along with communication lines such PCIe lines 130which support communication of large bit streams, each FPGA 150 orprimary FPGA 170 then accesses the host memory 120 and obtains theconfiguration bit image (or file) (which configuration also generally isloaded into the FPGA 150 or primary FPGA 170). By using the DMA engine,much larger files may be transferred quite rapidly, particularlycompared to any packet- or word- based transmission (which wouldotherwise have to be assembled by the host CPU 110, a comparatively slowand labor-intensive task). This is generally performed in parallel (orserially, depending upon the capability of the host memory 120) for allof the FPGAs 150 or primary FPGAs 170. In turn, each primary FPGA 170then transmits (typically over JTAG lines 145 or PCIe communicationlines 130) the configuration bit image (or file) to each of thesecondary FPGAs 160, also typically in parallel. Alternatively, eachprimary FPGA 150 may re-transmit (typically over JTAG lines 145 or PCIecommunication lines 130) the information of the load FPGA message orcommand to each of the secondary FPGAs 160, namely the memory address inthe host memory 120 and the file size, and each secondary FPGA 160 mayread or otherwise obtain the configuration bit image, also using DMAengines, for example and without limitation. As another alternative, thehost computing system 105 may transmit the load FPGA message or commandto each of the FPGAs 150 or primary FPGAs 170 and secondary FPGAs 160,which then obtain the configuration bit image, also using DMA engines asdescribed above. All such variations are within the scope of thedisclosure.

By using communication lines such as PCIe lines 130 and JTAG lines 145with the design of the system 100, 200, 300, 400, the configuration bitimage is loaded quite rapidly into not only into each of the FPGAs 150and primary FPGAs 170 but also into each of the secondary FPGAs 160.This allows not only for an entire computing module 175 (or computingmodules 180, 185, 195) to be reloaded in seconds, rather than hours, butthe entire system 100, 200, 300, 400 may be configured and reconfiguredin seconds, also rather than hours. As a result, read and writeoperations to local memory (e.g., nonvolatile memory 140) largely may bebypassed almost completely in the configuration process, resulting in ahuge time savings. In selected embodiments, if desired but certainly notrequired, the configuration bit image (or file) may also be storedlocally, such as in nonvolatile memory 140 (and/or nonvolatile memory190 (e.g., FLASH) associated with computing modules 175, 180, 185, 195,115).

As a result of this ultrafast loading of configurations, anothersignificant advantage of the system 100, 200, 300, 400 is thecorresponding capability, using the same process, for ultrafastreconfiguration of the entire system 100, 200, 300, 400. This isparticularly helpful for the design, testing and optimization of system100, 200, 300, 400 configurations for any given application, includingvarious computationally intensive applications such as bioinformaticsapplications (e.g., gene sequencing).

FIG. 5 is a flow diagram illustrating an exemplary or representativemethod embodiment for system configuration and reconfiguration, andprovides a useful summary of this process. Beginning with start step 240and one or more FPGA 150, 160, 170 configurations (as configuration bitimages) having been stored in a host memory 120, the system 100, 200,300, 400 powers on or otherwise starts up, and the FPGAs 150, 160, 170load the base communication functionality such as a PCIe configurationimage (and possibly DMA functionality) from nonvolatile memory 140, step245. Step 245 is optional, as such communication functionality also canbe provided to FPGAs 150, 160, 170 via GPIO (or GP I/O) lines 131(general purpose input and output lines), for example and withoutlimitation. The host CPU 110 (or more generally, host computing system105) then generates and transmits a “load FPGA” command or message toone or more FPGAs 150 or primary FPGAs 170 (and/or secondary FPGAs 160),step 250, in which the load FPGA command or message includes a startingmemory address (in host memory 120) and a file size designation for theselected configuration bit image which is to be utilized. Using the DMAengines, and depending upon the selected variation (of any of thevariations described above), the one or more FPGAs 150 or primary FPGAs170 (and/or secondary FPGAs 160) obtain the configuration bit image fromthe host memory 120, step 255, and use it to configure. Also dependingupon the selected embodiment, the one or more FPGAs 150 or primary FPGAs170 may also transfer the configuration bit image to each of thesecondary FPGAs 160, step 260, such as over JTAG lines 145 and bypassingnonvolatile memory 140, 190, which the secondary FPGAs 160 also use toconfigure. Also depending upon the selected embodiment, theconfiguration bit image may be stored locally, step 265, as a possibleoption as mentioned above. Having loaded the configuration bit imageinto the FPGAs 150, 160, 170, the method may end, return step 270, suchas by generating an interrupt signal back to the host computing system105.

The systems 100, 200, 300, 400 enable one of the significant features ofthe present disclosure, namely, the highly limited involvement of thehost CPU 110 in data transfers between the host computing system 105 andany of the FPGAs 150, 160, 170, and their associated memories 190, andadditionally, the highly limited involvement of the host CPU 110 in datatransfers between and among any of the FPGAs 150, 160, 170, and theirassociated memories 190, all of which frees the host computing system105 to be engaged in other tasks, and further is a significant departurefrom prior art systems. Once data transfer directions or routes havebeen established for a given or selected application within the systems100, 200, 300, 400, moreover, these data communication paths arepersistent for the duration of the application, continuing without anyfurther involvement by the host computing system 105, which is also asharp contrast with prior art systems.

Instead of a host CPU 110 “bit banging” or transferring a data file,including a very large data file, to each FPGA 150, 160, 170 or itsassociated memories 190, data transfers within the system 100, 200, 300,400 occur rapidly and in parallel, and following setup of the DMAregisters in the various FPGAs 150, 160, 170, largely withoutinvolvement of the host computing system 105. The data transfer pathsare established by the host CPU 110 (or an FPGA 150, 160, 170 configuredfor this task) merely transmitting a message or command to one or moreFPGAs 150, 160, 170 to set the base DMA registers within the FPGA 150,160, 170 with a memory 190 address (or address or location in the hostmemory 120, as the case may be), optionally a file size of the datafile, and a stream number., i.e., the host CPU 110 (or another FPGA 150,160, 170 configured for this task) sets the DMA registers of the FPGA(s)150, 160, 170 with the memory address (and optionally a file size) forthe selected data file in the host memory 120 or in one of the memories190, and also assigns a stream number, including a tie (or tied) streamnumber if applicable. Once this is established, the system 100, 200,300, 400 is initialized for data transfer, and these assignments persistfor the duration of the application, and do not need to bere-established for subsequent data transfers. It should be noted thatthe data to be transferred may originate from anywhere within a system100, 200, 300, 400, including real-time generation by any of the FPGAs150, 160, 170, any of the local memories, including memories 190, inaddition to the host memory 120, and in addition to reception from anexternal source, for example and without limitation.

The host CPU 110 (or an FPGA 150, 160, 170 configured for this task) hastherefore established the various data transfer paths between and amongthe host computing system 105 and the FPGAs 150, 160, 170 for theselected application. As data is then transferred throughout the system100, 200, 300, 400, header information for any data transfer includesnot only a system address (e.g., PCIe address) for the FPGA 150, 160,170 and/or its associated memories 190, but also includes the “stream”designations (or information) and “tie (or tied) stream” designations(or information), and is particularly useful for multi-threaded or otherparallel computation tasks. The header (e.g., a PCIe data packet header)for any selected data transfer path includes: (1) bits for an FPGA 150,160, 170 and/or memory 190 address and optionally a file size; (2)additional bits for an assignment a stream number to the data transfer(which stream number can be utilized repeatedly for additional data tobe transferred subsequently for ongoing computations); and (3)additional bits for any “tie stream” designations, if any are utilizedor needed. In addition, as each FPGA 150, 160, 170 may be coupled to aplurality of memories 190, each memory address typically also includes adesignation of which memory 190 associated with the designated FPGA 150,160, 170.

FIG. 6 is a block diagram illustrating exemplary or representativefields for a (stream) packet header 350, comprising a plurality of bitsdesignating a first memory address (field 305) (typically a memory 190address), a plurality of bits designating a file size (field 310) (as anoptional field), a plurality of bits designating a (first) stream number(field 315), and as may be necessary or desirable, two additional andoptional tie stream fields, namely, a plurality of bits designating the(second) memory 190 address for the tied stream (field 320) and aplurality of bits designating a tie (or tied) stream number (field 325).

Any application may then merely write to the selected stream number orread from the selected stream number for the selected memory 190 address(or FPGA 150, 160, 170 address), without any involvement by the hostcomputing system 105, for as long as the application is running on thesystem 100, 200, 300, 400. In addition, for data transfer throughout thesystems 100, 200, 300, 400, data transfer in one stream may be tied to adata transfer of another stream, allowing two separate processes tooccur without involvement of the host computing system 105. The first“tie stream” process allows the “daisy chaining” of data transfers, so adata transfer to a first stream number for a selected memory 190 (orFPGA 150, 160, 170 process) on a first computing module 175, 180, 185,195, 115 may be tied or chained to a subsequent transfer of the samedata to another, second stream number for a selected memory 190 (or FPGA150, 160, 170 process) on a second computing module 175, 180, 185, 195,115, e.g., data transferred from the host computing system 105 or from afirst memory 190 on a first computing module 175, 180, 185, 195, 115(e.g., card “A”) (stream “1”) to a second memory 190 on a secondcomputing module 175, 180, 185, 195, 115 (e.g., card “B”) will also befurther transmitted from the second computing module 175, 180, 185, 195,115 (e.g., card “B”) as a stream “2” to a third memory 190 on a thirdcomputing module 175, 180, 185, 195, 115 (e.g., card “C”), thereby tyingstreams 1 and 2, not only for the current data transfer, but for theentire duration of the application (until changed by the host computingsystem 105).

The second “tie stream” process allows the chaining or sequencing ofdata transfers between and among any of the FPGAs 150, 160, 170 withoutany involvement of the host computing system 105 after the initial setupof the DMA registers in the FPGAs 150, 160, 170. As a result, a dataresult output from a first stream number for a selected memory 190 (orFPGA 150, 160, 170 process) on a first computing module 175, 180, 185,195, 115 may be tied or chained to be input data for another, secondstream number for a selected memory 190 (or FPGA 150, 160, 170 process)on a second computing module 175, 180, 185, 195, 115, e.g., stream “3”data transferred from the a first memory 190 on a first computing module175, 180, 185, 195, 115 (e.g., card “A”) will transferred as a stream“4” to a second memory 190 on a second computing module 175, 180, 185,195, 115 (e.g., card “B”), thereby tying streams 3 and 4, not only forthe current data transfer, but for the entire duration of theapplication (also until changed by the host computing system 105).

Any of these various data transfers may occur through any of the variouscommunication channels of the systems 100, 200, 300, 400, and to andfrom any available internal or external resource, in addition totransmission over the PCIe network (PCIe switch 125 with PCIecommunication lines 130), including through the non-blocking crossbarswitch 220 (as an option) and over the JTAG lines 145 and/or GP I/Olines 131 and/or communication lines 210, depending upon the selectedsystem 100, 200, 300, 400 configuration. All of these various mechanismsprovide for several types of direct FPGA-to-FPGA communication, withoutany ongoing involvement by host computing system 105 once the DMAregisters have been established. Stated another way, in therepresentative embodiments, the host CPU 110 is then literally done withthe data transfer process, and from the perspective of the hostcomputing system 105, following transmission of the DMA setup messageshaving a designation of a memory 190 address, a file size (as anoption), and a stream number, the data transfer configuration process iscomplete. This is a huge advance over prior art methods of data transferin supercomputing systems utilizing FPGAs.

Using a DMA engine, along with communication lines such PCIe lines 130which support communication of large bit streams, each FPGA 150, 160,170 then accesses the host memory 120, or a memory 190, or any otherdata source, and obtains the data file for a read operation, or performsa corresponding write operation, all using the established address andstream number. By using the DMA engine, much larger files may betransferred quite rapidly, particularly compared to any packet- orword-based transmission. This is generally performed in parallel (orserially, depending upon the application) for all of the FPGAs 150, 160,170.

By using communication lines such as PCIe lines 130 and JTAG lines 145with the design of the system 100, 200, 300, 400, data transfer occursquite rapidly, not only into each of the FPGAs 150 or primary FPGAs 170but also into each of the secondary FPGAs 160, and their associatedmemories 190. As a result, resources, including memory 190, may beshared across the entire system 100, 200, 300, 400, with any FPGA 150,160, 170 being able to access any resource anywhere in the system 100,200, 300, 400, include any of the memories 190 on any of the computingmodules or cards (modules) 175, 180, 185, 195, 115.

FIG. 7 is a flow diagram illustrating an exemplary or representativemethod embodiment for data transfer within a system 100, 200, 300, 400and provides a useful summary. Beginning with start step 405, one ormore DMA registers associated with any of the FPGAs 150, 160, 170 andtheir associated memories 190 are setup, step 410, with a memory (120,190) address, a file size (as an option, and not necessarily required),a stream number, and any tie (or tied) stream number. Using the DMAengines for read and write operations, or using other availableconfigurations within FPGAs 150, 160, 170, data is transferred betweenand among the FPGAs 150, 160, 170 using the designated addresses andstream numbers, step 415. When there are any tied streams, step 420,then the data is transferred to the next tied stream, step 425, as thecase may be.

When there are additional data transfers, step 430, the method returnsto step 415, and the process iterates. Otherwise, the method determineswhether the application is complete, step 435, and if not, returns tostep 415 and iterates as well. When the application is complete in step435, and there is another application to be run, step 440, the methodreturns to step 410 to set up the DMA registers for the nextapplication, and iterates. When there are no more applications to berun, the method may end, return step 445.

The present disclosure is to be considered as an exemplification of theprinciples of the invention and is not intended to limit the inventionto the specific embodiments illustrated. In this respect, it is to beunderstood that the invention is not limited in its application to thedetails of construction and to the arrangements of components set forthabove and below, illustrated in the drawings, or as described in theexamples. Systems, methods and apparatuses consistent with the presentinvention are capable of other embodiments and of being practiced andcarried out in various ways.

Although the invention has been described with respect to specificembodiments thereof, these embodiments are merely illustrative and notrestrictive of the invention. In the description herein, numerousspecific details are provided, such as examples of electroniccomponents, electronic and structural connections, materials, andstructural variations, to provide a thorough understanding ofembodiments of the present invention. One skilled in the relevant artwill recognize, however, that an embodiment of the invention can bepracticed without one or more of the specific details, or with otherapparatus, systems, assemblies, components, materials, parts, etc. Inother instances, well-known structures, materials, or operations are notspecifically shown or described in detail to avoid obscuring aspects ofembodiments of the present invention. In addition, the various Figuresare not drawn to scale and should not be regarded as limiting.

Reference throughout this specification to “one embodiment”, “anembodiment”, or a specific “embodiment” means that a particular feature,structure, or characteristic described in connection with the embodimentis included in at least one embodiment of the present invention and notnecessarily in all embodiments, and further, are not necessarilyreferring to the same embodiment. Furthermore, the particular features,structures, or characteristics of any specific embodiment of the presentinvention may be combined in any suitable manner and in any suitablecombination with one or more other embodiments, including the use ofselected features without corresponding use of other features. Inaddition, many modifications may be made to adapt a particularapplication, situation or material to the essential scope and spirit ofthe present invention. It is to be understood that other variations andmodifications of the embodiments of the present invention described andillustrated herein are possible in light of the teachings herein and areto be considered part of the spirit and scope of the present invention.

It will also be appreciated that one or more of the elements depicted inthe Figures can also be implemented in a more separate or integratedmanner, or even removed or rendered inoperable in certain cases, as maybe useful in accordance with a particular application. Integrally formedcombinations of components are also within the scope of the invention,particularly for embodiments in which a separation or combination ofdiscrete components is unclear or indiscernible. In addition, use of theterm “coupled” herein, including in its various forms such as “coupling”or “couplable”, means and includes any direct or indirect electrical,structural or magnetic coupling, connection or attachment, or adaptationor capability for such a direct or indirect electrical, structural ormagnetic coupling, connection or attachment, including integrally formedcomponents and components which are coupled via or through anothercomponent.

A CPU or “processor” 110 may be any type of processor, and may beembodied as one or more processors 110, configured, designed, programmedor otherwise adapted to perform the functionality discussed herein. Asthe term processor is used herein, a processor 110 may include use of asingle integrated circuit (“IC”), or may include use of a plurality ofintegrated circuits or other components connected, arranged or groupedtogether, such as controllers, microprocessors, digital signalprocessors (“DSPs”), parallel processors, multiple core processors,custom ICs, application specific integrated circuits (“ASICs”), fieldprogrammable gate arrays (“FPGAs”), adaptive computing ICs, associatedmemory (such as RAM, DRAM and ROM), and other ICs and components,whether analog or digital. As a consequence, as used herein, the termprocessor should be understood to equivalently mean and include a singleIC, or arrangement of custom ICs, ASICs, processors, microprocessors,controllers, FPGAs, adaptive computing ICs, or some other grouping ofintegrated circuits which perform the functions discussed below, withassociated memory, such as microprocessor memory or additional RAM,DRAM, SDRAM, SRAM, MRAM, ROM, FLASH, EPROM or E²PROM. A processor (suchas processor 110), with its associated memory, may be adapted orconfigured (via programming, FPGA interconnection, or hard-wiring) toperform the methodology of the invention, as discussed above. Forexample, the methodology may be programmed and stored, in a processor110 with its associated memory (and/or memory 120) and other equivalentcomponents, as a set of program instructions or other code (orequivalent configuration or other program) for subsequent execution whenthe processor is operative (i.e., powered on and functioning).Equivalently, when the processor 110 may implemented in whole or part asFPGAs, custom ICs and/or ASICs, the FPGAs, custom ICs or ASICs also maybe designed, configured and/or hard-wired to implement the methodologyof the invention. For example, the processor 110 may be implemented asan arrangement of analog and/or digital circuits, controllers,microprocessors, DSPs and/or ASICs, collectively referred to as a“processor”, which are respectively hard-wired, programmed, designed,adapted or configured to implement the methodology of the invention,including possibly in conjunction with a memory 120.

The memory 120, 140, 190, which may include a data repository (ordatabase), may be embodied in any number of forms, including within anycomputer or other machine-readable data storage medium, memory device orother storage or communication device for storage or communication ofinformation, currently known or which becomes available in the future,including, but not limited to, a memory integrated circuit (“IC”), ormemory portion of an integrated circuit (such as the resident memorywithin a processor 110), whether volatile or non-volatile, whetherremovable or non-removable, including without limitation RAM, FLASH,DRAM, SDRAM, SRAM, MRAM, FeRAM, ROM, EPROM or E²PROM, or any other formof memory device, such as a magnetic hard drive, an optical drive, amagnetic disk or tape drive, a hard disk drive, other machine-readablestorage or memory media such as a floppy disk, a CDROM, a CD-RW, digitalversatile disk (DVD) or other optical memory, or any other type ofmemory, storage medium, or data storage apparatus or circuit, which isknown or which becomes known, depending upon the selected embodiment.The memory 120, 140, 190 may be adapted to store various look up tables,parameters, coefficients, other information and data, programs orinstructions (of the software of the present invention), and other typesof tables such as database tables.

As indicated above, the processor 110 is hard-wired or programmed, usingsoftware and data structures of the invention, for example, to performthe methodology of the present invention. As a consequence, the systemand method of the present invention may be embodied as software whichprovides such programming or other instructions, such as a set ofinstructions and/or metadata embodied within a non-transitory computerreadable medium, discussed above. In addition, metadata may also beutilized to define the various data structures of a look up table or adatabase. Such software may be in the form of source or object code, byway of example and without limitation. Source code further may becompiled into some form of instructions or object code (includingassembly language instructions or configuration information). Thesoftware, source code or metadata of the present invention may beembodied as any type of code, such as C, C++, SystemC, LISA, XML, Java,Brew, SQL and its variations (e.g., SQL 99 or proprietary versions ofSQL), DB2, Oracle, or any other type of programming language whichperforms the functionality discussed herein, including various hardwaredefinition or hardware modeling languages (e.g., Verilog, VHDL, RTL) andresulting database files (e.g., GDSII). As a consequence, a “construct”,“program construct”, “software construct” or “software”, as usedequivalently herein, means and refers to any programming language, ofany kind, with any syntax or signatures, which provides or can beinterpreted to provide the associated functionality or methodologyspecified (when instantiated or loaded into a processor or computer andexecuted, including the processor 110, for example).

The software, metadata, or other source code of the present inventionand any resulting bit file (object code, database, or look up table) maybe embodied within any tangible, non-transitory storage medium, such asany of the computer or other machine-readable data storage media, ascomputer-readable instructions, data structures, program modules orother data, such as discussed above with respect to the memory 120, 140,190, e.g., a floppy disk, a CDROM, a CD-RW, a DVD, a magnetic harddrive, an optical drive, or any other type of data storage apparatus ormedium, as mentioned above.

Furthermore, any signal arrows in the drawings/Figures should beconsidered only exemplary, and not limiting, unless otherwisespecifically noted. Combinations of components of steps will also beconsidered within the scope of the present invention, particularly wherethe ability to separate or combine is unclear or foreseeable. Thedisjunctive term “or”, as used herein and throughout the claims thatfollow, is generally intended to mean “and/or”, having both conjunctiveand disjunctive meanings (and is not confined to an “exclusive or”meaning), unless otherwise indicated. As used in the description hereinand throughout the claims that follow, “a”, “an”, and “the” includeplural references unless the context clearly dictates otherwise. Also asused in the description herein and throughout the claims that follow,the meaning of “in” includes “in” and “on” unless the context clearlydictates otherwise.

The foregoing description of illustrated embodiments of the presentinvention, including what is described in the summary or in theabstract, is not intended to be exhaustive or to limit the invention tothe precise forms disclosed herein. From the foregoing, it will beobserved that numerous variations, modifications and substitutions areintended and may be effected without departing from the spirit and scopeof the novel concept of the invention. It is to be understood that nolimitation with respect to the specific methods and apparatusillustrated herein is intended or should be inferred. It is, of course,intended to cover by the appended claims all such modifications as fallwithin the scope of the claims.

It is claimed:
 1. A computing system, the computing system comprising: aPCIe communication network comprising a PCIe switch and a plurality ofPCIe communication lines; a plurality of memory circuits; and aplurality of configurable logic circuits, each configurable logiccircuit of the plurality of configurable logic circuits coupled to thePCIe communication network and to at least one memory circuit of theplurality of memory circuits, each configurable logic circuit, of theplurality of configurable logic circuits, configured to generate a datapacket for each data transfer of a plurality of data transfers, the datapacket comprising a data transfer header having a designation of a firstmemory address and a stream number.
 2. The computing system of claim 1,wherein each configurable logic circuit of the plurality of configurablelogic circuits is further configured to generate a plurality of datatransfers to any other configurable logic circuit of the plurality ofconfigurable logic circuits.
 3. The computing system of claim 1, furthercomprising: at least one field programmable gate array configured as anon-blocking crossbar switch and coupled to the configurable logiccircuits; wherein each configurable logic circuit of the plurality ofconfigurable logic circuits is further configured to perform one or moredata transfers through the PCIe communication network or through thenonblocking crossbar switch, and wherein each configurable logic circuitof the plurality of configurable logic circuits is further configured toperform at least some data transfers of the plurality of data transfersin parallel and directly between or among the plurality of configurablelogic circuits.
 4. The computing system of claim 1, wherein eachconfigurable logic circuit of the plurality of configurable logiccircuits is further configured to include a file size designation in thedata transfer header.
 5. The computing system of claim 1, wherein thecomputing system further comprises one or more processor circuits, andwherein prior to commencement of a computing application to be performedby one or more configurable logic circuits of the plurality ofconfigurable logic circuits, the one or more processor circuits areadapted to transmit a plurality of DMA register messages to theconfigurable logic circuits of the plurality of configurable logiccircuits, each DMA register message having a plurality of designationscomprising a memory address of the plurality of memory circuits, a filesize, and the stream number.
 6. The computing system of claim 5, whereineach DMA register maintains the plurality of designations until anotherDMA register message changing the plurality of designations is received.7. The computing system of claim 1, wherein each configurable logiccircuit of the plurality of configurable logic circuits is furtherconfigured to include a designation of a second memory address and a tiestream number in one or more data transfer headers.
 8. The computingsystem of claim 7, wherein in response to receiving a data transferincluding the designation of the second memory address and the tiestream number, each configurable logic circuit of the plurality ofconfigurable logic circuits is further configured to forward the datatransferred and the tie stream number to the second memory address. 9.The computing system of claim 1, further comprising: a plurality of datacommunication lines coupling the plurality of configurable logiccircuits in series, and wherein each configurable logic circuit of theplurality of configurable logic circuits is further configured toperform one or more data transfers through the plurality of datacommunication lines.
 10. A method of data transfer in a system, thesystem comprising a PCIe communication network, a plurality of memorycircuits, and a plurality of configurable logic circuits, with eachconfigurable logic circuit of the plurality of configurable logiccircuits coupled to the PCIe communication network and to at least onememory circuit of the plurality of memory circuits, the methodcomprising: configuring each configurable logic circuit of the pluralityof configurable logic circuits to generate a plurality of data transfersto any other configurable logic circuit of the plurality of configurablelogic circuits; and configuring each configurable logic circuit of theplurality of configurable logic circuits to generate a data packet, andfor each data transfer of the plurality of data transfers, the datapacket comprising a data transfer header having a plurality ofdesignations comprising a first memory address of the plurality ofmemory circuits and a stream number of a plurality of stream numbers.11. The method of claim 10, wherein the system further comprises one ormore processor circuits, the method further comprising: using the one ormore processor circuits, transmitting a plurality of DMA registermessages to one or more configurable logic circuits of the plurality ofconfigurable logic circuits, each DMA register message having aplurality of designations comprising a selected memory address of theplurality of memory circuits, a selected file size, and a selectedstream number of the plurality of stream numbers.
 12. The method ofclaim 11, further comprising: maintaining the plurality of designationsin each DMA register until another DMA register message changing the oneor more designations of the plurality of designations is received by theone or more configurable logic circuits of the plurality of configurablelogic circuits.
 13. The method of claim 10, further comprising:configuring each configurable logic circuit of the plurality ofconfigurable logic circuits to include a file size designation in eachdata transfer header.
 14. The method of claim 10, further comprising:configuring each configurable logic circuit of the plurality ofconfigurable logic circuits to include a designation of a second memoryaddress and a tie stream number in one or more data transfer headers.15. The method of claim 14, further comprising: configuring eachconfigurable logic circuit of the plurality of configurable logiccircuits to, in response to receiving a data transfer including thedesignation of the second memory address and the tie stream number,forward the data transferred and the tie stream number to the secondmemory address.
 16. The method of claim 10, wherein at least some datatransfers of the plurality of data transfers occur in parallel anddirectly between or among the plurality of configurable logic circuits.17. A system comprising: a PCIe communication network comprising a PCIeswitch and a plurality of PCIe communication lines; a plurality ofmemory circuits; a plurality of configurable logic circuits, eachconfigurable logic circuit of the plurality of configurable logiccircuits coupled to the PCIe communication network and to at least onememory circuit of the plurality of memory circuits, each configurablelogic circuit of the plurality of configurable logic circuits configuredto generate a plurality of data transfers to and from any otherconfigurable logic circuit of the plurality of configurable logiccircuits, each configurable logic circuit of the plurality ofconfigurable logic circuits further configured to generate a data packetfor each data transfer of the plurality of data transfers, the datapacket comprising a data transfer header having a plurality ofdesignations comprising a first memory address and a stream number; andone or more processor circuits coupled to the PCIe communicationnetwork, wherein prior to commencement of a computing application to beperformed by one or more configurable logic circuits of the plurality ofconfigurable logic circuits, the one or more processor circuits areadapted to transmit a plurality of DMA register messages to the one ormore configurable logic circuits of the plurality of configurable logiccircuits, each DMA register message having a plurality of designationscomprising a selected memory address of the plurality of memorycircuits, a selected file size, and a selected stream number.
 18. Thesystem of claim 17, wherein each data transfer header further comprisesa file size designation, and wherein each DMA register maintains theplurality of designations until another DMA register message changingone or more designations of the plurality of designations is received.19. The system of claim 17, wherein configurable logic circuit of theplurality of configurable logic circuits is further configured toinclude a designation of a second memory address and a tie stream numberin one or more data transfer headers.
 20. The system of claim 19,wherein in response to receiving a data transfer including thedesignation of the second memory address and the tie stream number, eachconfigurable logic circuit of the plurality of configurable logiccircuits is further configured to forward the data transferred and thetie stream number to the second memory address.